Building Comparable Corpora Based on Bilingual LDA Model

نویسندگان

  • Zede Zhu
  • Miao Li
  • Lei Chen
  • Zhenxin Yang
چکیده

Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent topics own better adaptability and stability performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora

Methods dealing with bilingual lexicon extraction from comparable corpora are often based on word co-occurrence observation and are by essence more effective when using large corpora. In most cases, specialized comparable corpora are of small size, and this particularity has a direct impact on bilingual terminology extraction results. In order to overcome insufficient data coverage and to make ...

متن کامل

Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge

In this paper, we extend the work on using latent cross-language topic models for identifying word translations across comparable corpora. We present a novel precisionoriented algorithm that relies on per-topic word distributions obtained by the bilingual LDA (BiLDA) latent topic model. The algorithm aims at harvesting only the most probable word translations across languages in a greedy fashio...

متن کامل

Building bilingual terminologies from comparable corpora: the TTC TermSuite

In this paper, we exploit domain-specific comparable corpora to build bilingual terminologies. We present the monolingual term extraction and the bilingual alignment that will allow us to identify and translate high specialised terminology. We stress the huge importance of taking into account both simple and complex terms in a multilingual environment. Such linguistic diversity implies to combi...

متن کامل

A Combination of Models for Bilingual Lexicon Extraction from Comparable Corpora

In this paper we present a method to extract bilingual terminologies from comparable non-aligned corpora, by using multiple linguistic knowledge sources, such as: non-parallel corpora, bilingual thesauri, a preliminary bilingual dictionary, etc... We focus on two core technologies: bilingual lexicon extraction from comparable corpora and expansion through thesauri categories based on different ...

متن کامل

Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction

The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced. However, the historical contextbased projection method dedicated to this task is relatively insensitive to the sizes of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013